Skip to content

Conversation

@zhouhao3
Copy link
Contributor

@zhouhao3 zhouhao3 commented Jan 25, 2022

Avoid the situation that ironic-agent.service will not restart after failure due to some circumstances.

Signed-off-by: Zhou Hao [email protected]

@openshift-ci openshift-ci bot requested review from andfasano and sadasu January 25, 2022 03:15
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 25, 2022

Hi @zhouhao3. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 25, 2022
@hs0210
Copy link

hs0210 commented Jan 26, 2022

IMO adding Restart=on-failure is a code improvement, not a bug fix.

When testing IPI with #33 on iRMC, ironic-agent sometimes failed to start successfully, but after a manual restart, the service ran successfully. We noticed that the error was related to the network. Unfortunately, we didn't record the error message at that time. Now we've done 6 tests in a row but can't reproduce the error. We guess that when the service was pulling the image, the proxy was just unavailable, causing the service to fail to start.

Therefore, if the service itself supports restart after failure, it can avoid permanent failure of the service due to factors such as network instability.

@dtantsur
Copy link
Member

/ok-to-test

@openshift-ci openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 28, 2022
@hs0210
Copy link

hs0210 commented Jan 28, 2022

The error reappeared.

-- Logs begin at Fri 2022-01-28 23:33:55 UTC, end at Fri 2022-01-28 23:38:23 UTC. --
Jan 28 23:34:41 master-1 systemd[1]: Starting Ironic Agent...
Jan 28 23:34:42 master-1 podman[2243]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b9f235258d1bf30f5819089506e1ee5b698dcbe1b52ce9e49cbaad6ad96c1897...
Jan 28 23:34:46 master-1 podman[2243]: Getting image source signatures
Jan 28 23:34:50 master-1 podman[2243]: Error: Error reading blob sha256:eac1b95df832dc9f172fd1f07e7cb50c1929b118a4249ddd02c6318a677b506a: Get "https://cdn02.quay.io/sha256/ea/eac1b95df832dc9f172fd1f07e7cb50c1929b118a4249ddd02c6318a677b506a?Expires=1643413370&Signature=Tix43Y0YVkJaSgjvc6stWUiASYCRxEc7C-JSNge-R4zbwrS9bhgNNIMJDBpy4lbNj-r7KJJw7q~rmO8xiI47sYJkI1zLRAFFijTFDVCKbNGxaYUBfUvrC0BEcCdLTosgkS8m2Nv1BOqjQcZI61B2ChjLMtYFzmwllSOMga9UV9A2iZJDTi9pSjLPaFl-qam1W57FJQIh107gfA3lo7NounE337WKt6rQWSVgFfvFe0G~Sjs3yMknHcCvjuiKlF86F05RJAD0VBPMOCnDqYHLHOA~-6d-ZmSmBqK307Ab2~Bic~eqahQKxewWuRK70AWtG24LLKlDeGBumH006nTKSw__&Key-Pair-Id=APKAJ67PQLWGCSP66DGA": ERROR
Jan 28 23:34:50 master-1 systemd[1]: ironic-agent.service: Control process exited, code=exited status=125
Jan 28 23:34:50 master-1 systemd[1]: ironic-agent.service: Failed with result 'exit-code'.
Jan 28 23:34:50 master-1 systemd[1]: Failed to start Ironic Agent.

@fenggw-fnst
Copy link

It happened again:

-- Logs begin at Sat 2022-01-29 06:31:32 UTC, end at Sat 2022-01-29 07:46:48 UTC. --
Jan 29 06:32:19 worker-0 systemd[1]: Starting Ironic Agent...
Jan 29 06:32:20 worker-0 podman[2277]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f07e1596763e4f1b0834cb700037fd1958071ae26492d8c3fc686afc5ef560cb...
Jan 29 06:32:26 worker-0 podman[2277]: Error: Error parsing image configuration: Get "https://cdn02.quay.io/sha256/db/db1135106ca881540040a27ecd1e8d3af97eff3881095ea9a48a1f5b1d1e407d?Expires=1643438266&Signature=NFPoGINK-IW0lgieHhLa~G-9m462emXWjEvcXqMJhMAJSNPZWrA4ywibhgxguUYy60GZFRVpZcZsm0OQ3aRhLkbTrRPCgv6H3ndmCV-kpHgvqajCm-QZM5PBI~pdnNajxyLPKqfVEZq9cntOE3CidbERlUNT~92P4cL2Zlp~nE1JHvFS~Y5BkpvscQWshEUEBlc~ZmAPO1QF9CSGbLlB7TFsNg22M65rpikNqHszyk9N5STuq6p5SqrRTGs78JUm-AIEDq~CZWfJVcHTqliCUxfMvSJEsl-wLIfcz2NfQIhctDHwrmGVEWdKjMs1Y7T4UkHGb5IEpBoarp~nApvSkw__&Key-Pair-Id=APKAJ67PQLWGCSP66DGA": ERROR
Jan 29 06:32:26 worker-0 systemd[1]: ironic-agent.service: Control process exited, code=exited status=125
Jan 29 06:32:26 worker-0 systemd[1]: ironic-agent.service: Failed with result 'exit-code'.
Jan 29 06:32:26 worker-0 systemd[1]: Failed to start Ironic Agent.

@andfasano
Copy link
Contributor

In case of non-transient failure, with this approach isn't the service restarted continuosly?

@dtantsur
Copy link
Member

Restarting the service matches the behavior of our previous ramdisk, so it should be fine.

Please fix the unit test failure.

@andfasano
Copy link
Contributor

Restarting the service matches the behavior of our previous ramdisk, so it should be fine.

Yes, I was thinking more to the fact that in case of permanent failure (ie network outage) the installation will probably fail after the bootstrap timeout (30m), while a mechanism based on a counter could fail faster.

@dtantsur
Copy link
Member

I don't think we have a way to communicate the agent failure back to the control plane.

@andfasano
Copy link
Contributor

I don't think we have a way to communicate the agent failure back to the control plane.

Right, not sure if at least the error reporting could be improved then ( I guess @hs0210 fetched the journal log directly from within the instance?)

@zhouhao3 zhouhao3 force-pushed the add-restart-for-service branch from 71d5d4e to 02d1bec Compare February 7, 2022 01:37
@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 7, 2022
@zhouhao3 zhouhao3 force-pushed the add-restart-for-service branch 2 times, most recently from d46a421 to 6a48405 Compare February 7, 2022 01:46
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 7, 2022
@zhouhao3
Copy link
Contributor Author

zhouhao3 commented Feb 7, 2022

/retest

@zhouhao3
Copy link
Contributor Author

zhouhao3 commented Feb 7, 2022

@dtantsur @andfasano updated, PTAL.

@zhouhao3
Copy link
Contributor Author

zhouhao3 commented Feb 7, 2022

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 7, 2022

@zhouhao3: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@zhouhao3
Copy link
Contributor Author

Right, not sure if at least the error reporting could be improved then ( I guess @hs0210 fetched the journal log directly from within the instance?)

We can log in to the master to get the journal log, but we haven't thought of a good way to improve the error report.

@dtantsur @andfasano When we use IPI deployment, we still encounter this problem from time to time. Can you review this PR? thanks.

@dtantsur
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 14, 2022
@andfasano
Copy link
Contributor

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 14, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andfasano, zhouhao3

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2022
@openshift-merge-robot openshift-merge-robot merged commit 5c400b9 into openshift:main Feb 14, 2022
@sshnaidm
Copy link

Can we port it to 4.10 as well?

@elfosardo
Copy link
Contributor

/cherry-pick release-4.10

@openshift-cherrypick-robot

@elfosardo: new pull request created: #52

Details

In response to this:

/cherry-pick release-4.10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants